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Abstract — This paper proposes a new notion of typical 
sequences on a wide class of abstract alphabets (so-called 
standard Borel spaces), which is based on approximations of 
memoryless sources by empirical distributions uniformly over 
a class of measurable "test functions." In the finite-alphabet 
case, we can take all uniformly bounded functions and recover 
the usual notion of strong typicality (or typicality under the 
total variation distance). For a general alphabet, however, this 
function class turns out to be too large, and must be restricted. 
With this in mind, we define typicality with respect to any 
Glivenko-Cantelli function class (i.e., a function class that 
admits a Uniform Law of Large Numbers) and demonstrate its 
power by giving simple derivations of the fundamental limits 
on the achievable rates in several source coding scenarios, in 
which the relevant operational criteria pertain to reproducing 
empirical averages of a general-alphabet stationary memoryless 
source with respect to a suitable function class. 

Index Terms — Coordination via communication, empirical 
processes, Glivenko-Cantelli classes, rate distortion, source 
coding, standard Borel spaces, typical sequences, uniform laws 
of large numbers. 



I. Introduction 

The notion of typical sequence has been central to infor- 
mation theory since Shannon's original paper [1]. For finite 
alphabets, it leads to simple and intuitive proofs of achiev- 
ability in a wide variety of source and channel coding set- 
tings, including multiterminal scenarios [2]. Another appealing 
aspect of typical sequences is that they provide a language 
for approximation of information sources in total variation 
distance using finite communication resources. Recent work 
of Cuff et al. [3] on coordination via communication serves as 
a particularly striking example of the power of this language. 

For abstract alphabets, however, most of this power is lost; 
while such results as the asymptotic equipartition property 
carry over [4], in most other situations, particularly involving 
lossy codes, one has to resort to ergodic theory [5] or large 
deviations theory [6]. Direct approximation of abstract mem- 
oryless sources in total variation using empirical distributions 
is, in general, impossible (cf. Sec. IV for details). However, it 
is precisely this direct approximation that renders typicality- 
based proofs of achievability so transparent. 

The present paper makes two contributions. First, we pro- 
pose a way to revise the notion of typicality for general 
alphabets (more specifically, standard Borel spaces [7], [8]), 
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Fig. 1. Empirical coordination of actions in a two-node network. 
Node A (resp., B) observes a random n-tuple (resp., Xg), where 
(X x i, Xb,i), ■ • ■ , (Xa ni Xb n) are i.i.d. pairs of correlated random vari- 
ables. A message is sent from Node A to Node B at rate R to specify the 
n-tuple U". 



allowing for similarly transparent achievability arguments. 
When two probability measures are close in total variation, 
the corresponding expectations of any bounded measurable 
function are also close. For general alphabets, when one of 
the measures is discrete, this is too much to ask. Instead, we 
advocate an approach based on suitably restricting the class of 
functions on which we would like to match statistical expec- 
tations with sample (empirical) averages. Provided the Law of 
Large Numbers holds uniformly over the restricted function 
class, we can speak of typical sequences with respect to this 
class and develop typicality-based achievability arguments in 
close parallel to the finite-alphabet case. The central object of 
study is the empirical process [9]-[l 1] indexed by the function 
class, which gives information on the deviation of empirical 
means from statistical means for a given realization of the 
source under consideration, and the total variation distance is 
replaced by the supremum norm of this empirical process. 

The second contribution consists of applying our new notion 
of typicality to several source coding problems which, follow- 
ing the terminology of [3], can be thought of as "empirical 
coordination" of actions in a two-node network. Roughly 
speaking, the objective is to use communication resources in 
order to reproduce (or approximate) the empirical distribution 
of a given source sequence, rather than the sequence itself, 
with or without side information. This coordination viewpoint 
suggests a new operational framework suitable for problems 
involving distributed learning, control, and sensing. 

A. Preview of the results 

Consider the two-node network shown in Figure 1 . There is 
an alphabet Xa associated with Node A, and two alphabets, 
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Xb and U, associated with Node B. Initially, Node A (resp., 
Node B) observes a random rt-tuple X\ £ X A (resp., 
XI G X"), where the pairs (X A , U X B>1 ), . . . , {X A , n , X B> „) 
are i.i.d. draws from some specified probability law Px A x B 
on X^ x Xb- We also have a target conditional probability 
law Pu\x A x B on U given Xa and A^g- Node A, given its 
knowledge of X A , Px A x B , and -P[/|x A x s , communicates 
some information M to Node B at rate i?. The latter receives 
M and, using its knowledge of Xg, Px A x B , and P^x^s, 
generates an n-tuple U n e U™. 

Now imagine that there is an external observer with access 
to X™ (where • is either A or B) and [/", who also knows 
Px A x B an d P[/|x 4 x B - This observer has a collection T of 
"test functions" / : X. x U — > [— 1, 1] and can compute the em- 
pirical expectation (or sample average) n _1 J27=i f{^*,i: Ui) 
and the "true" expectation Ef(X,,U) w.r.t. the joint law 
Px A x B u = Px A x B ® P)7|x A x B for any / G J". We 
assume that Nodes ^4 and P know T, but do not know 
which / £ J the observer will pick. The objective is then 
to minimize the expected worst-case deviation between the 
empirical expectations and the true expectations: 



holds: for any i.i.d. random process {Zi}°^L 1 over Z, we have 



minimize E sup 



n 

-y2f(X., h Ui)-Ef(X.,U) 



over all admissible encoding and decoding strategies given 
the rate constraint R and the information patterns at the two 
nodes (i.e., which node knows what). In other words, the goal 
is to ensure that, from the observer's viewpoint, the empirical 
distribution of U)}l l =1 is as close as possible to the 

target distribution Px.u m the sense that the corresponding 
expectations of all / G T are as close as possible, uniformly 
over T. Operational criteria of this kind arise, e.g., in the 
context of statistical learning from random samples [12], 
[13], where the functions in T may be viewed as candidate 
predictors of U given X,. 

In this paper, we consider two special cases of this set-up: 

1) Given two alphabets X and Y, we take X^ = X, 
Xb = 0, U = Y, • = A. This is a generalization of 
the basic two-node empirical coordination problem [3, 
Section III.C] to abstract alphabets. (A related problem, 
though with a slightly different operational criterion, is 
lossy source coding with respect to a family of distortion 
measures [14].) 

2) We have X and Y as above, but now X^ = U = X, 
X B = Y, and • = B. Moreover, P V \x A x B = Pu\x B = 
Px A \x B l - This is a generalization of the problem of 
communication of probability distributions [15] to ab- 
stract alphabets, where we also allow side information 
at the decoder (Node B). 

Our achievability results hinge on the assumption that the 
function class T admits the Uniform Law of Large Numbers 
(ULLN). Given an abstract alphabet Z, we say that a class T of 
functions / : Z 4 [-1, 1] admits the ULLN if the following 

'More precisely, we require that, under Px A x B u< and U are 

conditionally i.i.d. given Xg. 
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The quantity inside the | • | is referred to as the empirical 
process associated with Z n , and describes the fluctuations of 
the sample mean of each / around its expectation. We define 
an n-tuple z n = (zi, . . . , z n ) G Z™ to be e-typical w.r.t. T for 
a probability law P if 



sup 



n 

-]T/(z 4 )-E P /(Z) 
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Turning now to the set-up of Figure 1, let us assume that 
the observer's function class T satisfies the ULLN and that 
• = B. Then a simple achievability argument exploits the 
fact (which we prove under mild regularity conditions) that, 
for any probability law Q — Qx A x B u under which Ij -) 
Xa -4- U is a Markov chain, there exists a rate-i? encoding 
U n (X%) from X^ into U™ such that the tuple (X%,U n ) is 
e-typical w.r.t. T for Q, provided R > I(X A ; U\X B ). When 
Xg = 0, so • = A (as in the empirical coordination scenario), 
we simply apply the above argument to "degenerate" Markov 
chains of the form X A — > X A — > U, where the rate condition 
becomes R > I(Xa; U). 

We list the salient features of our approach: 

• When the underlying alphabet Z is finite, the ULLN is 
satisfied by the class of all functions / : Z — > [—1,1], 
and our definition of typicality reduces to strong typicality 
[2], [3]. 

• When Z is a complete separable metric space, the ULLN 
is satisfied by the class of all Lipschitz functions / : Z — » 
[—1,1] with 1 1 /| |oo < 1 and Lipschitz constant bounded 
by 1. Moreover, the ULLN in this case is equivalent to 
almost sure weak convergence of empirical distributions 
(Varadarajan's theorem [16, Theorem 11.4.1]). 

• In general, there is a veritable plethora of function classes 
satisfying the ULLN (we present several examples in 
Section III-A). For instance, when Z = R d , the ULLN 
is satisfied by the indicator functions of all halfspaces, 
balls, or rectangles (and of finite unions or intersections 
thereof). One example, particularly relevant in source 
coding, is the collection of indicator functions of Voronoi 
cells induced by an arbitrary set of m points in M. d , for 
any fixed m — indeed, any such cell is an intersection of 
0(m) halfspaces. Hence, our results apply to the setting 
where X. x U C R d and each Ui) is observed 
through an m-point nearest-neighbor quantizer. 

B. Related work 

The focus of the present paper is exclusively on source 
coding. However, a recent preprint of Mitran [17] uses weak 
convergence to develop an extension of typical sequences to 
Polish alphabets and then applies that definition to several 
channel coding problems, including an achievability result for 
Gel'fand-Pinsker channels [18] with input cost constraints. 
What distinguishes Mitran's work from ours is his careful 
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use of several equivalent characterizations of weak conver- 
gence via the portmanteau theorem [16, Theorem 11.1.1]. 
In particular, his approach requires an explicit construction 
of a countable generating set for the underlying Borel er- 
algebra that consists of the continuity sets of the probability 
law of interest. As a consequence, he is able to establish a 
generalization of the Markov lemma [19], [20], which in turn 
allows him to use binning just like in the finite-alphabet case. 
By contrast, our notion of typicality is considerably broader 
(and, in fact, contains the one based on weak convergence as a 
special case), but, since we do not make any major structural 
assumptions beyond those needed for the ULLN, we cannot 
establish anything as strong as the Markov lemma. However, 
our proof technique does not rely on the Markov lemma in 
its strong form, and is more in the spirit of Wyner and Ziv 
[21]-[23]. 

We also note that a restricted notion of typicality based on 
weak convergence was used by Kontoyiannis and Zamir [24] 
in the context of universal vector quantization using entropy 
codes. The idea there is to consider sequences of increasing 
length, whose empirical distributions converge in the weak 
topology to the output distribution of an optimal test channel 
in a Shannon rate-distortion problem. 

C. Contents of the paper 

The remainder of the paper is organized as follows. Sec- 
tion II sets up the notation and lists the preliminaries. In 
Section III we formally define function classes that satisfy 
the ULLN and give several examples. Then, in Section IV 
we motivate and formally describe our approach to typicality 
and establish a number of key properties, including a lemma 
on the preservation of typicality in a Markov structure. Next, 
in Section V, using this lemma as the main technical tool, 
we illustrate the power of the proposed new approach by 
proving three theorems concerning fundamental limits on min- 
imal achievable rates for (i) two-node empirical coordination; 
(ii) rate-constrained distributed approximation of empirical 
processes with side information at the decoder; and (iii) 
lossy source coding under a family of distortion measures. 
Although these results apply to general (uncountably infinite) 
alphabets, the proofs are as intuitive and simple as in the 
finite-alphabet scenario. We follow up with some concluding 
remarks in Section VI. Lengthy proofs and discussions of 
auxiliary technical results are relegated to the Appendices. 

II. Preliminaries and notation 

All spaces in this paper are assumed to be standard Borel 
spaces (for detailed treatments, see the lecture notes of Preston 
[7] or Chapter 4 of Gray [8]): 

Definition 1. A measurable space (Z,Bz) is standard Borel 

if it can be metrized with a metric d such that (1) (Z,d) 
is a complete separable metric space, and (2) Bz coincides 
with the Borel a -algebra of (Z,d) (the smallest a -algebra 
containing all open sets). 

Remark 1. A Polish space (i.e., a separable topological space 
whose topology can be metrized with a complete metric) is 



automatically standard Borel. In fact, the most general known 
class of standard Borel spaces consists of Borel subsets of 
Polish spaces [8, Theorem 4.3]. 

From now on, when dealing with a (standard Borel) space 
Z, we will often not mention its Borel er-algebra explicitly. In 
particular, we will tacitly assume that all probability measures 
on Z are defined w.r.t. Bz- The main objects associated with 
Z that are of interest to us are as follows: 

• P(Z) is the space of all probability measures on Z 

• M(Z) is the space of all measurable functions / : Z — » K 

• M b (Z) C M(Z) is the normed space of all bounded 
measurable functions / : Z — » E with the sup norm 

H/lloo = SUP |/(2f)j (3) 

zez 

. M b ' 1 (Z) = {/eA./"(Z):||/|| 00 <l}. 
Other notation will be introduced as needed. 

Standard Borel spaces possess just enough useful structure 
for our purposes. In particular, their er-algebras are countably 
generated and contain all singletons. They also admit the 
existence of regular conditional distributions: If Z = X x Y 
with the product er-algebra, then the probability law P G V(Z) 
of any random couple (X, Y) G Z can be disintegrated as 

P{AxB)= [ P Y{x (B\x)Px(dx),VAeBx,BeBy (4) 

J A 

where Px G P(X) is the marginal distribution of X and 
^V|x('|") : By x X 4 [0,1] is a Markov kernel, i.e., 
Py\xI\x) G V{Y) for all x G X and P Y \x(B\-) G M(X) 
for all B G By. Given a random triple (U, X,Y) G U x X x Y 
with joint law P G V(ii x X x Y), we will say that they form 
a Markov chain in that order (and write U X Y) if 

P ulXY (A\x, y) = P mx (A\x), VA G Bu (5) 

for P-almost all x, y. 

We will often use de Finetti's linear functional notation for 
expectations [25, Section 1.4]. That is, for any P G V(Z) and 
a P-integrable function / : Z — > R, 

P(f) 4 E P f(Z) = J fdP, (6) 

and we will extend this notation in an obvious way to integrals 
with respect to signed Borel measures on Z. Given a class T of 
measurable functions / G M b,1 (Z), we can define a seminorm 
on the space of all signed measures on Z via 

Hb=supK/)|. (7) 

As an example, \\P—P i \\m>>- 1 (z) is precisely the total variation 
distance 

\\P-P>\\ TV ^2 sup \P(A)-P'(A)\ (8) 

AGBz 

between P,P' e V(Z). 

We will make use of several standard information-theoretic 
definitions [5]. The divergence between P and P' in V(Z) is 
defined as 

D(p||P04 (p(iogWdP0), if p«p> (9) 

I +oo, otherwise 
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Given a Q £ P(Xx Y), the mutual information between X £ 
X and Y £ Y with joint law Q is 

I(Q)=D(Q\\Q X ®Q Y ), (10) 

where Qx ® Qy is the product of the marginals. Whenever 
Q is clear from context, we will also write I(X; Y) instead 
of I(Q). We will use standard notation for such things as the 
conditional mutual information. 

III. Uniform Laws of Large Numbers and 
Glivenko-Cantelli classes 

Given an n-tuple z n = (z\, ... , z n ) £ Z", let us denote by 
P z t> the induced empirical measure: 

1 ™ 

p^iyx, (id 

i=i 

where 5 Z . is the Dirac measure concentrated at Zi (since Bz 
contains all singletons, S z £ V(Z) for every z £ Z). If {Zi}°^ 1 
is an i.i.d. sequence with common distribution P £ 'P(Z), 
then the Strong Law of Large Numbers says that, for any 
/ £ M b,1 (Z), the empirical means 

n 

Pzn(f) = -J2f(Z i ), n£N (12) 
n * — ' 

i=i 

converge to the true mean P(f) almost surely. By the union 
bound, this holds for any finite family of functions. In this 
paper, we consider infinite function classes that admit a 
Uniform Law of Large Numbers — that is, absolute deviations 
between empirical and true means converge to zero uniformly 
over the function class. The canonical example of such a 
class appears in the celebrated Glivenko-Cantelli theorem [16, 
Theorem 1 1.4.2]: Let Z be a real-valued random variable with 
CDF Fz, and let {Z. L } C *L 1 be an infinite sequence of i.i.d. 
copies of Z. For each n, consider the empirical CDF 

*z«V)±\ilhz<<*y (13) 

<=1 

The Glivenko-Cantelli theorem then says that 

sup|F z „(z)-F z (z)| a.s. (14) 

To cast it as a statement about a function class, consider 

J 7 = {/. = l(-oo,*] : * e »} • (15) 
Then for any z £ R, 

Fz»W = Pz"(/z) (16) 

F z {z) = Pz(L) (17) 

and consequently 

sup|F z „(z)-F z (z)| = ||P Z „ -py ^=^>0 a.s. (18) 

This motivates the following definition [9]— [ 11]: 



Definition 2. A class T of measurable functions f £ M b,1 (Z) 
is called Glivenko-Cantelli 2 (or GC, for short) if 

\\Pz~-Py^^0 a.s. (19) 

for every P £ V(Z), where {Z i }°° =1 is an i.i.d. random process 
with marginal distribution P. 

Remark 2. In view of this definition, the classical Glivenko- 
Cantelli theorem can be paraphrased as follows: The class of 
all indicator functions of semi-infinite intervals of the form 

(-oo,z], z £ K, is GC. 

Remark 3. The restriction to bounded functions is mostly 
needed for technical convenience and can be removed by 
means of suitable moment conditions and straightforward, 
though tedious, truncation arguments. A nice side benefit 
of the boundedness assumption, though, is that no loss of 
generality occurs if the almost sure convergence in (19) is 
replaced with convergence in probability [10], [26]. 

Remark 4. It should be borne in mind that when the function 
class T is uncountable, ||Pz« — P\r m ^y not be a random 
variable (there is always a risk of spawning a nonmeasurable 
monster whenever one dabbles in uncountable operations). 
There are a number of ways to deal with such issues, as 
detailed in [9, Appendix] or [10, Section 2.3]. For our pur- 
poses, it will suffice to assume that T is countable or "nice" 
in the sense that it contains a countable subset Q such that for 
every / £ T there is a sequence {g m } in Q converging to / 
pointwise. Then 

||P 2 »-P||^ = ||Pz»-P||{; J (20) 

and the r.h.s. is a measurable function of Z n [10, p. 110]. 

Let (f2,B,P) be an underlying probability space for the 
random process {Zi}. Then for each n we can construct 
another random process on (0,£>,P), indexed by T: 

Aj l) H^P z „ M (/)-P(/), f£F. (21) 

This is an instance of an empirical process [9] — [1 1], which 
is used to describe the fluctuations of the empirical means 
Pz n (f) around the expectation P(f). A GC class is one for 
which the £°°(.F) norms 

llA^'Hll = sup|A ( ; i) H| (22) 

of the empirical processes {A^'j/gjr, n > 1, converge to 
zero almost surely. 

A. Examples of Glivenko-Cantelli classes 

We close this section by listing several examples of GC 
classes. Usually, whether or not a given class T is GC depends 
on how "large" it is. The simplest notion of size is captured 
by the (metric) entropy numbers of T [27]. Given any e > 0, 
the covering number N{e,T,Q) of T C Af M (Z) w.r.t. a 
probability measure Q £ V(Z) is the minimal number of balls 

2 Strictly speaking, the proper term is "universal Glivenko-Cantelli," but we 
will follow standard usage and just say "Glivenko-Cantelli." 
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{9 ■ h - f\W{Q) < e}, f £ M M (Z), of radius e needed to 
cover T. The entropy number of T is log-/V(e, J 7 , Q). Then 
(under additional measurability assumptions, cf. Remark 4) JF 
is GC if 



sup N(e, T, Q) < oo, 
Qe-P(z) 



Ve > 0. 



(23) 



Other conditions for a class to be GC involve alternative no- 
tions of entropy, such as entropy with bracketing. Chapter 2 of 
van der Waart and Wellner [10] contains a detailed exposition 
of these matters. Examples 1-4 below follow [10]; Example 5 
shows that the well-known theorem of Varadarajan on almost 
sure weak convergence of empirical measures can be stated in 
the form of a ULLN for an appropriate GC class. 

Example 1 (Vapnik-Chervonenkis classes). Given any collec- 
tion A C Bz and any finite set CcZ, define 



S{A,C) = \{CnA:AeA}\ 
SJA) = max S(A,C) 

\C\<n 



(24) 
(25) 



and let V{A) = max{n £ N : S n (A) = 2"}. After the fun- 
damental work of Vapnik and Chervonenkis [28] where these 
combinatorial parameters were first introduced, any class A 
such that V(A) < oo is called a Vapnik-Chervonenkis (VC) 
class, and V(A) is called its Vapnik-Chervonenkis (VC) 
dimension. Examples of VC classes include: 

• The class of all rectangles in R d with VC dimension 2d. 

• The class of all linear halfspaces H w b = {z £ M d : 
(w, z) + b > 0} for w £ R d , b £ E, with VC dimension 
d+1. 

• The class of all closed balls B xr — {z £ R d : \\z — x\\ < 
r} for x € R d , r G R+, with VC dimension 

Given a collection A C Bz, let T = P4 consist of the 
indicator functions of the elements of A: J- a = {1a : A £ A}. 
Then Tj± is GC, provided A is a VC class. 

Finite set-theoretic operations (unions, intersections, com- 
plements) on VC classes yield VC classes as well. In particu- 
lar, consider the collection of all Voronoi cells induced by all 
m-point subsets of R d . Each member of this collection is an 
intersection of 0(m) halfspaces, and therefore we have a VC 
class. Likewise, injective images of VC classes are VC. 

Example 2 (VC-subgraph classes). Given a function / £ 
M(Z), its subgraph is the subset of Z x R, given by {(z,t) : 
f(x) > t}. A class of functions T C M(Z) is called a VC- 
subgraph class if the collection of all subgraphs of all / £ T 
is a VC class in Z x K. We define V(T), the VC dimension 
of T, as the VC dimension of the corresponding collection of 
subgraphs. For example, if J 7 is a linear span of m functions 
fit ■ ■ • t fm G -^(Z), then it is a VC-subgraph class with 
V(F) < TO+2. In this paper, we are interested primarily in the 
case when T C A/ M (Z). Hence, if fx, ...,f m e M 6 ' 1 (Z), 
then their convex hull is a VC-subgraph class. 

Example 3 (VC-hull classes). A class of functions T C 
M(Z) is a VC-hull class if there exists a VC-subgraph class 
Q C M(Z), such that every / £ T is a point wise limit of a 



sequence of functions {/„} contained in the symmetric convex 
hull of Q, 

{mm \ 
^c l5l : to € N;^|ci| < l; 5l ,..., ffm £ Q \ (26) 
i=l i=l J 

For example, the set of all monotone functions / : M — > [0,1] 
is VC-hull (though not VC-subgraph). 

Example 4 (Smooth functions). Let Z = [0, l} d . For any 
multi-index, i.e., a vector k = (fei, . . . ,kg) £ {0,1,... } d , 
define the differential operator 



D- 



k A 



0|fc| 



dz\ 1 . . . dz k d d 



(27) 



where \k\ = k\ 

/: [0,l] d ^M 



kd- Given a > 0, define for a function 



I f |L = max sup\D-f(z)\ 
+ max sup 



\D^f(z)-D*f(z')\ 



(28) 



Let C a be the set of all continuous functions / : [0, l] d —> M 
with ||/|| Q < 1. Then C a is a GC class. 

Example 5 (Bounded Lipschitz functions). Let (Z, d) be a 
complete separable metric space. Define the Lipschitz semi- 
norm || • ||l on M(Z) by 



l = sup 



\m-f(z')\ 

d{z, z') 

and the bounded Lipschitz norm \\ ■ ||bl by 

II/IIbl = ll/lloo + II/IIl. 



(29) 



(30) 



Note that any function / with ||/||bl < oo is automatically in 
C fc (Z), the Banach space of all bounded continuous functions 
on Z. 

Let J"i L = {/ S C b (Z) : ||/|| BL < 1}. Then T is a 
GC class. This is a consequence of the fact that the bounded 
Lipschitz metric (also known as the Fortet-Mourier metric) 

/3(P,P')= sup \P(f)-P'(f)\ (31) 

= ||P-P'||^i l P,P'eV(Z) (32) 

metrizes the topology of weak convergence in "P(Z). Recall 
that a sequence {P n } in P(Z) converges weakly to P £ 'P(Z) 
(the fact denoted by P n P) if 



D ^P(/), V/GC 6 (Z). 



Then P„ P if and only if /3(P„,P) 



(33) 
^ [16, 

Theorem 11.3.3]. Now, according to a theorem of Varadarajan 
[16, Theorem 11.4.1], given any i.i.d. random process {Zi}°^ 1 
over Z with common marginal distribution P £ P(Z), the 
empirical distributions Pz™ converge weakly to P almost 
surely: 



P 



a.s. 



(34) 
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From the foregoing discussion, (34) is equivalent to 

P(Pzn,P)= sup |P z „(/)-P(/)| (35) 

= ||p zn _p|| Jn^>0 a.s. (36) 

^BL 

In other words, P^ L is a GC class, and Varadarajan's theorem 
can be paraphrased to say that this function class obeys a 
ULLN. 

IV. Rethinking typicality for general alphabets 

Now that all necessary definitions are made, we can intro- 
duce our revised notion of typicality for standard Borel spaces. 

For finite alphabets, there are multiple equivalent defini- 
tions of a typical sequence. Here is one, based on the total 
variation distance [3], often referred to as strong typicality [2, 
Section 10.6]: 

Definition 3. Given a finite set Z and a probability distribution 
(mass function) P on it, the typical set Te n \P), for e > 0, is 
the set of all n-tuples z n G Z™ whose empirical distributions 
P z n are e-close to P in total variation: 

% {n \P) ± {z n G Z" : ||P Z « -P|| TV < £}• (37) 

By the Law of Large Numbers, if {Zi} is a sequence of i.i.d. 
draws from P, then 

¥(Z n £ % {n) (P)) 0. (38) 

If Z is a Cartesian product X x Y, then one can define jointly 
and conditionally typical sets and sequences [2]. 

However, all of this breaks down for general (uncountably 
infinite) alphabets. The reason is that the total variation dis- 
tance between any discrete measure and a nonatomic measure 
is equal to 2. Indeed, if (Z, Bz) is a standard Borel space 
and P G V(Z) assigns zero mass to singletons, P({z}) = 
0,Vz G Z, then we can take any n-tuple z n G Z" and let A 
be the set of its distinct elements, so that P z n(A) = 1 and 
P(A) — 0. Using this and the definition (8), we deduce that 
||P*»-P||tv=2. 

Of course, one could use typicality arguments by consider- 
ing arbitrary finite quantizations of the underlying spaces, but, 
as long as we are dealing with nonatomic measures, this does 
not get rid of the above issue even in the limit of increasingly 
fine quantizations. While discretization is sufficient for many 
purposes [5], there is another issue that arises when dealing 
with Markov structures in multiterminal settings: quantization 
destroys the Markov property [29, Section VIII]. 

To resolve this conundrum, we recall (cf. Sec. II) that 

\\P-P\\ TV = sup \P{f)-P'{f)\, (39) 

ll/IU<i 

where the supremum is over all measurable functions / : 
Z — >• [—1,1]. When the underlying measurable space supports 
nonatomic probability measures, this function class turns out 
to be too large to admit uniform convergence of empirical 
averages to statistical expectations. A natural solution, then, is 
to restrict the class of functions: 

Definition 4. Let Z be a Borel space and let T C M b,1 (Z) 
be a GC class of functions. Given a probability measure P G 



V(Z), the typical set T^{P), for e > 0, is the set of all n- 
tuples z n G Z™ whose empirical distributions P z n are e-close 
to P in the \\ ■ \\jr seminorm: 

lf){P) 4 {z n G Z" : ||P, B -P\\r<e}. (40) 

One thing to note is that when Z is finite, we can just 
take P = AI b ' 1 (Z) and immediately recover Definition 3. 
Moreover, if Z is a complete separable metric space, then we 
can take P = P^, in which case our notion of typicality 
becomes compatible with the bounded Lipschitz metric that 
metrizes the weak topology on the space of probability laws 
(cf. Example 5). 



A. Basic properties of GC typical sets 

We now establish several basic properties of GC typical 
sets. First of all, any sufficiently long sequence emitted by a 
stationary memoryless source is typical with high probability: 

Proposition 1. Consider a Borel space Z and a GC class 
T C M b,1 {7.). If {Zi}°°^ x is an i.i.d. random process over Z 
with common law P, then for any e > 

lim P(Z n 4 r} n J(P)) = (41) 

Proof: Immediate from definitions. ■ 
Another desirable property is for typicality to be preserved 
under coordinate projections. It is not hard to show that, for 
any two finite alphabets X and Y and any two n-tuples x n G 
X n and y" G Y" that are jointly typical w.r.t. some P G 
V(X x Y) in the sense of Definition 3, x n (resp., y n ) is typical 
w.r.t. the marginal distribution Px (resp., Py). The following 
lemma gives a sufficient condition for GC typicality to be 
preserved under projections: 

Proposition 2. Suppose Z = X x Y. Let 7Tx : Z — > X be the 

coordinate projection mapping onto X, i.e., TTx{x,y) = x, and 
extend it to tuples via 

Kx((xi,yi), (x n ,y n )) = (xi, . . . ,x n ). (42) 

Then for any ngff, any e > 0, any P G "P(Z), and any GC 
class Px C M b,1 (X) such that P\ o 7rx C J 7 , we have the 
inclusion 

Tx(rW(F))cr^(Px). (43) 

Remark 5. As can be seen from the proof below, the class 
Px need not be GC in order for the inclusion (43) to hold. 
However, then one would not be able to transfer a convergence 
result like Proposition 1 to the X-valued part of the sequence. 

Proof: Suppose z n = ((x 1: y%), . . . , (x n , y n )) G 
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T £ (n J(P). Then 



sup 



sup 



< sup 
= IIP.- 

< £. 

-(«) 



n 

n J2f(xi)-Px(f) 

i=l 
n 

-J2f°Mzi)-P(f 



O 7T X , 



i=l 



-Pll 



f{zi)-P(f) 



(44) 
(45) 

(46) 

(47) 
(48) 



Thus, x n e T^ X {P X ), which proves (43). ■ 
As an example, let X = R k , Y = R m , let T be the collection 
of indicator functions of all halfspaces in Z = K. fc+m , and let 
.Fx be the collection of indicator functions of all halfspaces 
in X (cf. Example 1 for definitions and notation). For any 
w E ]R fc , b E R and z = (x, y) E Z, we have 

(w,x) + b = (w,n x (z)} + b (49) 
= {(w,0),{x,y)) + (b,0). (50) 



b) ° ""x for an Y choice of w E 



Hence, lff ( „ <0MM) 

R'jdeR, so the condition of the lemma is satisfied. 

Finally, we show that our definition of typicality can work 
in a multiterminal setting. Ideally, one would like to have 
something like the Markov lemma [19], [20]: If X -> Y -> Z 
is a Markov chain, (x n , y n ) is typical, and Z n is obtained by 
passing y n through a memoryless channel, then (x n ,y n ,Z n ) 
should be typical with high probability. However, in our setting 
such a statement does not make much sense without assuming 
additional structure for the function class J 7 . 3 Instead, we 
establish the following result, which is essentially an abstract 
alphabet version of the so-called Piggyback Coding Lemma 
of Wyner [21, Lemma 4.3]: 

Lemma 1. Let U E U, V E V, and W E W be random 
variables taking values in their respective standard Borel 
spaces according to a joint distribution Puvw> such that 
U — > V — > W is a Markov chain and I(V; W) < oo. Let 
{{Ui, Vi, be a sequence of i.i.d. draws from Pjjvw- 

Let J- C M b,1 (U x W) be a GC class of functions. For a given 
e > 0, there exist an n = n(e) and a mapping $„ : V n — > W™, 
such that 

- log |{$ n (0 : v n E V"}| < I(V; W) + e (51) 



and 



< e. 



(52) 



Proof: For each n, define the function ip n E M b:1 (U™ x 
W") by 



(53) 



incidentally, this is exactly what Mitran [17] accomplishes for his notion 
of typicality based on weak convergence. 



X' 

I 



Node A 



M = e n (X n ) 



Node B 



T 



Y n = d n (M) 



Fig. 2. Two-node empirical coordination. 



Since J 7 is a GC class, we have by Proposition 1 

lim Eip n (U n ,W n ) =0. (54) 

71— >00 

The desired statement now follows from Lemma A.l in 
Appendix A. ■ 

V. Applications to empirical coordination 

We now show three sample applications of GC typicality to 
the problem of empirical coordination in a two-node network 
shown in Figure 1. This problem, recently formulated and 
studied by Cuff et al. [3], concerns joint generation of actions 
at the two nodes, such that the empirical distribution of the 
actions over time approximates, asymptotically, a desired joint 
distribution in total variation. Our goal is to extend this setting 
to general alphabets. As we have shown in Section IV, the total 
variation criterion is unsuitable for uncountable alphabets, so 
we consider a relaxation to an appropriate GC class. 

As we will show, our notion of GC typicality and Lemma 1 
can be used to develop particularly intuitive achievability argu- 
ments and to obtain single-letter characterizations of the best 
achievable rates. Moreover, convexity of the || • || jr seminorm is 
helpful for proving converse results. The downside, however, 
is that, in general, it is not possible to compute the best 
achievable rates explicitly even for "simple" sources due to 
the presence of the supremum over T . 

A. Two-node empirical coordination 

Consider the two-node network shown in Fig. 2, where 
Node A (resp., Node B) generates actions from a Borel space 
X (resp., Y). At Node A, the actions are drawn i.i.d. from a 
fixed law Px E 'P(X). We also have a conditional probability 
measure Py\x that describes the desired distribution of actions 
at Node B given the actions at Node A. Following the 
terminology of [3], we will also refer to the choice of Py\x 
as a coordination. Node A can communicate with Node B 
over a rate-limited channel, and Node B uses the data it 
receives to choose its actions. For each n, let X n E X™ and 
Y n E Y™ denote the action sequences at the two nodes. Given 
a class T C Af b,1 (X x Y) of measurable "test fuctions" and 
a desired distortion level A > 0, the goal is for Node A to 
communicate with Node B at a minimal rate to guarantee that, 
asymptotically, 



E P 



Px®Py\x\ T <l± 



(55) 
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where P XY = Px ® P Y \x is the joint law induced by the 
source Px and the coordination Py\x- 

Definition 5. An (n, M)-code is a pair (e n ,d n ), where e n : 
X™ — > [M] is the encoder and d n : [M] —> Y" « f/ie decoder, 
and [M] ={1,2,..., M}. We will denote Y n = d n (e n (X n )). 

Definition 6. Given a source Px, a coordination Py\x> an d 
a distortion A, let £(A.,Py\x) denote the set of all Q € 
P(Xx Y), such that 

Qx=Px and \\Q - P x <8 P Y \x \\? < A. (56) 
Define the rate-distortion/coordination function of Px-' 

R(A,P Ylx )= inf I(Q). (57) 

1 Qe£(A,p Ylx ) 

Theorem 1. Let Py\x be a given coordination and A a given 
distortion level. 

a) Direct part: If T is a GC class and R(A, P Y \x) < oo, 
then for any e > there exist n = n(e) and an (n, 2 nR ) 
code (e n ,d n ) with R < R(A, Py\x) + £ satisfying 



E P 



Px ®Py\x\\j? < A + e. 



(58) 



b) Converse part: Suppose that there exists an (n,2 nR )- 
code Y n (X n ) = d n (e n (X n )), satisfying 



e I|P(x™,y™) ' 
Then R > R(A,P Y]X ). 



Px ®P Y \ X \\jr < A. 



(59) 



Remark 6. Note that the converse does not require T to 
be GC. However, it must be sufficiently "well-behaved" for 
||P/ X n Y n\ — Px ® ^VixIIj 7 to be measurable for any choice 
of a (measurable) encoder-decoder pair. 

Proof (direct part): To prove the direct part, fix 
(A,P Y \ X ) and pick any Q £ E(A,P Y \ X ) such that I(Q) < 
R(A,P Y \x) + e/2. Let X € X and U € Y have joint 
law Q. Then X — >• X —> U is a Markov chain, and 
Lemma 1 guarantees the existence of an n and a mapping 
$„ : X" -> Y™, such that 



1 



log|{$„(X")}| <I(Q)+e/2 

<R(A,P Y]x )+e 



and 



< e. 



Let 7™ = $ n (X n ). Then the triangle inequality gives 

EllP,™ -Pjf ®iV|x|U 



< Ell P, 



< A 



(60) 
(61) 

(62) 



(63) 
(64) 



which establishes (58). ■ 
Proof (converse part): For the converse, we will use the 
time mixing technique (cf. [3] and Appendix B). Let Y n (X n ) 
be an (n, 2 ni? )-code such that (59) holds. Let T be a random 
variable uniformly distributed over the set [n], independently 



of X™, and let Q denote the joint distribution of (Xt,Yt) 
Then 



nR> H(Y n {X n )) 

= H{Y n (X n )) - H(Y n (X n )\X n ) 
= I{X n ;Y n (X n )) 

n 

>J2 J (XuY t ) 
t=i 

= nI{X T -Y T \T) 
= nI(X T ;Y T ,T) 
> nI(X T ;Y T ) 
= nI(Q), 



(65) 
(66) 
(67) 

(68) 

(69) 
(70) 
(71) 
(72) 



where: 

• (65) holds because the log-cardinality of the range of 
Y n (-) is bounded by nR 

• (68) is a standard information-theoretic fact: if X n is 
an i.i.d. tuple, then for any sequence Yi,...,Y n jointly 
distributed with X n 



I(X n ;Y n )>yi(X t ;Y t ) 



t=i 



(73) 



• (69) follows from the construction of T 

• (70) holds because, by the chain rule for mutual infor- 
mation, 

I(X T ;Y T ,T) = I(Xt;T) + I(X t ;Yt\T), 

where the first term on the r.h.s. is zero because X n is 
i.i.d. (see Fact 1 in Appendix B). 
The remaining steps are consequences of other definitions and 
standard information-theoretic identities. 

Since X n is i.i.d., Xt is independent of T and has the 
same distribution as X\, namely Px- Moreover, the expected 
empirical distribution EP^ X „ Yn ^ is equal to P/ Xt y t ) = Q 
(Fact 2 in Appendix B). Thus, we can write 



\Q-Px ®Py\x\\jt 



= ||EP 

< E||P 

< A, 



(Xn,Y») 

(X«,Y») 



p 



X®Y\ 



T 



Px ®P Y \x\ r 



(74) 
(75) 
(76) 



where (75) follows from convexity, and (76) from (59). Hence, 
Q G £(A, P Y[X ), so R > I(Q) > R(A, JV|x). ■ 

B. Communication of empirical processes 

The next application we consider also concerns distributed 
approximation of an empirical process. We have a joint law 
P = P XY G T(X x Y). Let {(X*,^)}^ be an infinite 
sequence of independent draws from P. Consider the two- 
node network shown in Figure 3. Node A (resp., Node B) has 
perfect observations of {Xi} (resp., {Yi}). As before, Node 
A can transmit information to Node B over a rate-limited 
channel. The goal is for Node A to communicate with Node 
B at a minimal rate, so that Node B can approximate the 
desired empirical process to within a given distortion level A. 
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X n 

1 



Node A 



M = e n (X n ) 



Y n 



Node B 



X n = d n (M,Y n ) 



Fig. 3. Distributed compression of empirical processes. 



More precisely, given a block length n and denoting by X n 
the reconstruction of X n at Node B, we wish to guarantee 
that 



E||P (jKrn) -P||^<A. 



(77) 



This setting is a generalization of the problem of commu- 
nication of probability distributions, recently formulated and 
studied by Kramer and Savari [15] in the finite-alphabet 
setting. Here, we allow general alphabets and decoder side in- 
formation. As we will see, the minimum achievable rate admits 
a single-letter characterization reminiscent of the Wyner-Ziv 
rate-distortion function for lossy source coding with decoder 
side information [22], [23]. 

Definition 7. An (n, M)-code is a pair (e n ,d n ), where e n : 
X" -> [M] is the encoder and d n : [M] x Y™ -> X" is the 
decoder. We will denote X n = d n (e n (X n ),Y n ). 

Definition 8. Given a source P X y G V(X x Y), let £(A) 
denote the set 

{Q G V(X x Y x U) : U is standard Borel} 

such that: 

1) QxY = PXY 

2) Qu\xy — Qu\x (i-e., Y — > X — > U is a Markov chain) 

3) There is a function g : Y x U — > X, such that 

WQwy ~ P\\r < A, (78) 

where W = g(Y, U). 
With this, define the rate-distortion function 



R(A)± inf MQxu) - HQyu)]- 

QeS(A) 



(79) 
[0,1] 



Theorem 2. Let J- be a class of functions f : X x Y 
and A a nonnegative distortion level. 
a) Direct part: Suppose that T is a GC class, and that 
for any 6 > 0, \x G "P(X x Y) one can find a finite set 
{%j}jLi C X and a quantizer q : X — > {£j}, such that 



\Vq(x)Y - lA\? < 



(80) 



If -R(A) < oo, then for any e > f/iere ex/if an n = n{e) 
and an (n, 2 nR ) code with R < i?(A) + £ satisfying 



E P, 



where X n 



d„(e„(X"),r n ). 



(81) 



b) Converse part: Suppose that there exists an (n,2 nR )- 
code X n = d n (e n (X n ),Y n ) satisfying 



HP 



(x»,y») 



(82) 



J/zen i? > i?(A). 



Remark 7. The quantization assumption (80) is a "smooth- 
ness" condition on J 7 , and is akin to an assumption made by 
Wyner in [23] in order to extend the achievability part of the 
finite-alphabet result of [22] to abstract alphabets. 

Proof (direct part): First we show that, owing to the 
quantization assumption (80), we can assume w.l.o.g. that both 
Y and the auxiliary alphabet U are finite. This follows from 
the following lemma, whose proof is given in Appendix C: 

Lemma 2. Consider any law Q G £(A). 77ie«, for any 8 > 0, 
there exist finite measurable partitions {Ai}^^ and {Bj}j I l 1 
of Y and U and a function g\ : Y x U — > X such that: 

a) \\QwiY ~ P\\r < A + S, where Wi = gi{Y, U) 

b) g\ is constant on the rectangles Ai x Bj, 1 < i < 
N u l<j<N 2 

c ) {(Qxu) - HQyu) < HQxu) - HQyu) + S where 
Y = i for Y G Ai and U — j for U G Bj. 

Let us therefore assume that U and Y are both finite. We 
will use a Wyner-Ziv style two-step argument [22], [23]: The 
first step consists of using a long block code that preserves 
typicality (following Lemma 1), while the second step uses a 
Slepian-Wolf code [30] to communicate the codewords with 
negligible probability of error. Pick any Q G £(A) such that 



xv) 



I(Q YU )<R(A)+e/2. 



(83) 



Define j:YxU->XxYby g(y, u ) = (g(y,u),y) and 
consider the function class J- og c M b ' 1 (Y x U). Since F is 
a GC class, so is T o g — to see this, fix any /i G "P(Y x U) 
and let {(Yi,Ui)}°Z 1 be a sequence of i.i.d. draws from /x. 
Then for any n we can write 



(y",[f») _ Mll-Fog 
n 

^ f(g(Yi, Ui),Yi) ~ Ef(g(Y, U),Y) 





1 


= sup 






n 




1 


= sup 






n 



i=l 



/(wi,yi)-E/(w,y) 



(84) 

(85) 
(86) 



where W = g(Y,U). Thus, the GC property of F o g follows 
from the GC property of J 7 . 4 In view of this, we can apply 
Lemma 1 to the Markov chain Y — >• X — > U and to the GC 
class T o g to derive the existence of a large enough m and 
a mapping $ ni : X" 1 -> U" 1 , such that 



1 

m 



\og\{<f> ni (X n i)}\ < I(Q XU ) + s/2 



(87) 



4 By contrast, in order for the GC property to be preserved under left 
compositions, i.e., for ip o T to be a GC class for some rji : [0, 1] — > [0, 1], 
additional requirements must be imposed on ip (such as monotonicity or 
Lipschitz continuity). 
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and 



where 



HP 



(Y"i,C/"i) 

= Ell P 



1YU 



Fog 



o-Q 



f 



<e/2, 



W n - = {g{Y l ,U 1 ),...,g{Y ni ,U ni )) 



(88) 

(89) 
(90) 



We can use a blocking argument along the lines of Lemmas 3 
and 5 of Wyner and Ziv [22] to show that a sufficiently long 
sequence U ni (1), . . . , U ni (112) of i.i.d. realizations of U" can 
be losslessly encoded, using a Slepian-Wolf code, at a rate of 

1 -H(U^\Y^) < I{Qxu) ~ HQyu) + e/2 (91) 



"1 



< R(A)+e. 



(92) 



Let n — nin 2 , and let {J7,-}" =1 denote the resulting decoding. 
Then, if n 2 is large enough, we can guarantee that 



E P, 



V",(>)IUog - £ / 2 ' 



(93) 



(Y n ,U n ) (I -,u-j lU-og 

and therefore, with X n = (g(Y 1 ,U 1 ) ) g(Y n , U n )), that 

(94) 



Ell P, 



< E P 



E|| P 



I .Fog 



/Ft/ 



Fog 



< e. 



The triangle inequality then yields 



HP 



if 



<E||P ( ^^) 
< A + e. 



F 



F 



(95) 
(96) 



(97) 
(98) 



This gives a (n, 2 nJ? )-code with i? < R(A) + e. ■ 
Proof (converse part): To prove the converse, we again 
use time mixing. Let (e n ,d n ) be an (n,2 nR ) code, let 
J = e n (X n ) and X n = d n (J,Y n ), and let T be uniformly 
distributed on [n] independently of Define an 

auxiliary random variable 

17= (J,X T - 1 ,y T - 1 ,i^ +1 ,T) (99) 

/7 is a Markov 



(cf. [3], [22], [23]) and note that Y T -> X T 
chain. Moreover, 

nR > H( J) 
> H{J\Y n ) 
= I{X n ;J\Y n ) 

n 

= Y,HXuJ\Y n ,X t - 1 ) 

n 

= ^7(x i ;j,x t - 1 ,F t - 1 ,r^ 1 |y t ) 



nI(X T ; J, X 1 -\ Y 1 ~\ Y£ +1 \Y T , T) 



T-l v T-l 



= nI(X T ;J,X 1 -\Y 
= nI{X T -U\Y T ), 



,Y? +1 ,T\Y T ) 



(100) 
(101) 
(102) 

(103) 

(104) 

(105) 
(106) 
(107) 



where: 

• (100) holds because the log-cardinality of the range of 
e n (-) is bounded by nR 

• (104) follows from the chain rule and the fact that X t — > 
Y t -> (X t - 1 ,y*- 1 ,y3. 1 ) is a Mai-kov chain 

• (105) follows from the construction of T 

• (106) follows because, by the chain rule, 

I{X T :,J,X T -\Y T -\Y£ +11 T\Y T ) 
= I{X T ; T\Y T ) + I(X T ; J, X T ~\ Y T ~\ Y£ +1 \Y T , T), 



where the first term on the r.h.s. is zero because 
(X 1 ,Y 1 ),...,(X n ,Y n ) are i.i.d., so (X T ,Y T ) is inde- 
pendent of T (see Fact 1 in Appendix B). 
The remaining steps are consequences of other definitions and 
standard information-theoretic identities. 

Since {(^i,y)}"=i are i.i.d., (Xt,Yt) has the same joint 
law as (X\,Yx), namely Pxy- Moreover, Xt is a determinis- 
tic function of {Yt, U), and EP^„ yn 



p (x t ,y t )- Finall y< 



\ P {x T ,Y T ) -mIf — ll EP (X",y") 



< E||P 

< A, 



P\ 
P\ 



F 



F 



(108) 
(109) 
(HO) 



where (109) follows from convexity, and (110) follows from 
(82). Hence, the joint law of Xt, Yt, and U belongs to £(A), 
which means that R > I(X T ; U\Y T ) > R(A). ■ 

C. Lossy coding with respect to a class of distortion measures 

Finally, we consider the problem of lossy coding with 
respect to a class of distortion measures (fidelity criteria). 
For general (Polish) alphabets, it was solved by Dembo and 
Weissman [14], but the finite-alphabet variant appears already 
as Problem 14 in [31]. Let X and Y denote the source and 
the reproduction alphabets, respectively. Suppose a class T of 
distortion measures p : X x Y — >• [0, 1] is given, together with 
a class of nonnegative reals indexed by p £ V, {A p } pe r- The 
goal is to find a block code of minimal rate whose expected 
distortion under each p g T is bounded by the corresponding 
A p . We use the same definition of an (n, M)-code as in 
Section V-A. 

Define a mapping F(-,{A p }) : V(X x Y) — > K by 

(HI) 



F(Q,{A„}) = 8 up[Q(p)-A„], 
per 



where 



Q( P ) = J P dQ = J p(x,y)Q(dx,dy) (112) 

is the expected distortion between X and Y when they have 
joint law Q. 

Definition 9. Given a source Px € 'P(X), let £({A p }) denote 
the set of all Q £ P(X x Y) such that 

Qx=Px and F{Q,{A p })<0. (113) 

Define the rate-distortion function 



R({A p }) = inf I(Q) 
Q££({A P }) 



(114) 
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Theorem 1 of [14] shows that any rate R > i?({A p }) is 
achievable, provided the mapping Q H> F(Q, {A p }) is upper 
semicontinuous (u.s.c.) under the weak topology on?(Xx Y). 
Moreover, no rate R < R({A p }) is achievable. We now show 
that the u.s.c. requirement can be replaced by a GC condition: 

Theorem 3. Let T be a class of distortion measures and 
{A p } pe r a class of nonnegative distortion levels. 
a) Direct part: If T is a GC class and R({A p }) < oo, then 
for any e > 0, there exist an n = n(e) and an (n, 2 nR ) 
code with R < i?({A p }) + e satisfying 



Esup 
per 



p(X n ,Y n ) 



(115) 



where p(X n ,Y n )±P (xn y n) (p). 
b) Converse part: Suppose that there exists an (n,2 nR )- 
code Y n — d n (e n (X n )) satisfying 



VpeT. 



(116) 



Ep(X n ,Y n ) < A p , 
Then R > R({A P }). 

Proof: To prove the direct part, pick any Q € £ ({A p }) 
such that I(Q) < i?({A p }) +s/2. Let X e X and U £ Y have 
joint law Q. The same argument as in the proof of Theorem 1 
can be used to show the existence of a large enough n and a 
mapping $„ : X" — > Y n , such that 

ilog|{0>„(X")}|</(Q)+e/2 
<R({A p })+s 



(117) 
(118) 



and 



E P, 



where Y n = $ n (X n ). Now, for any p E T we have 



p(X n , Y n ) -A p < ||P (X „ 9n) -Q\\ r + F(Q, {A p }). 

(120) 

Consequently, taking the supremum of both sides over T and 
then the expectation w.r.t. we get (115). 

The proof of the converse is exactly the same as in [14]. ■ 

VI. Conclusion 

We have proposed a new definition of typical sequences 
over a wide class of abstract alphabets (standard Borel spaces), 
which retains many useful properties of strong (total-variation) 
typicality for finite alphabets. In particular, it is preserved in a 
Markov structure, which has allowed us to develop transparent 
achievability proofs in several settings pertaining to empirical 
coordination of actions in a two-node network using finite 
communication resources. Here are some directions for future 
research: 

• Behavior in the finite block length regime — GC classes 
with sufficiently "regular" metric or combinatorial struc- 
ture admit sharp concentration-of-measure inequalities of 
the form 



where C > is some constant and S(n; F) is a function 
of "moderate" growth in n, which typically depends on 
the geometric characteristics of F [9]— [1 1]. For example, 
if J" is a VC class, then S(n; F) = 0(n v ^); in the 
latter case, we also have 



E\\P Z n -PWjr < C 



V(F) 



(122) 



where C > is a universal constant. These inequalities 
can be used to investigate the behavior of our coding 
schemes in the finite block length regime (e.g., the rate 
of convergence of the achievable || • ||jr-distortion to the 
optimum). 

• Extension to stationary ergodic sources — Recently, 
Adams and Nobel [32] have shown that the ULLN holds 
for countable (or separable) classes of VC sets and 
functions even when the underlying process is stationary 
and ergodic (rather than i.i.d.), although without any 
specific guarantees on the rate of convergence. Their 
work opens the possibility of extending our GC typicality 
approach to stationary ergodic sources via sliding block 
codes [33]-[35]. 

• Connections to simulation of information sources — The 
operational criteria used in our treatment of empirical 
coordination suggest new ways of thinking about simu- 
lation of random processes and related problems in rate- 
distortion coding [3], [36]-[38]. Many problems related 
to sensing, learning, and control under communication 
constraints can be reduced (or related) to simulation of 
random processes, and our formalism may be of use 
for characterizing the fundamental information-theoretic 
limits in these settings. 

Appendix A 
Piggyback Coding Lemma for Borel spaces 

In this appendix we prove the following lemma, which is 
an extension of the Piggyback Coding lemma of Wyner [21, 
Lemma 4.3] to general alphabets: 

Lemma A.l. Let U,V, W be standard Borel spaces, and let 
(U, V, W) G U x V x W k a triple of random variables 
with joint law Puvw> such that U — > V — > W is a 
Markov chain and the mutual information I(V; W) is finite. 
Let {(Ui,Vi,Wi)}°^ 1 be a sequence of i.i.d. draws from 
Pjjvw- Let {VVil^i be a sequence of measurable functions 
ip n : U™ x W™ [0, 1], such that 



lim Eil> n (U n ,W n ) =0. 



(A.l) 



For a given e > 0, there exists no = no(e), such that for every 
n > no we can find a mapping F n : V™ — > W™ that satisfies 



-log 
n 



{F n (v n ) : v n e V"}| < I(V;W) 



+ e 



(A.2) 



and 



\Pz~-P\\r>e) < S(n;F)e 



-Cne A 



(121) 



E^ n (U n ,F n (V n ))<e. 



(A.3) 
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Proof: The proof is very similar to Wyner's proof for 
finite alphabets [21]. Fix any n and define a function cf> n : 
V" x W" ->■ [0, 1] by 



V v = v n , W n = w n 



(A.4) 



i) n ( U n ,W n )Pun lV n tW n(d U n \v n , W»). 



Owing to the Markov chain condition, we can write 



(A.5) 



c/> n (v n ,w n ) = Mu n ,w n )P Un \ v 4du n \v n ). (A.6) 

Letting S n = Etp n (U n , W n ), we define the set 

S n 4 {(v n ,w n ) eV"x W : io n ) < 7^}. (A.7) 
Then by the Markov inequality we have 

Plir.r)^,) ^^ ^. <a.8, 



Consider an arbitrary measurable mapping G : V" — » 
{w"(l),...,u;™(M)} C W" for some M < oo. Then, 
defining the set 



5„ = KeV" : (v n ,G(v n ))€S n }, 

we can write 

EMU n ,G(V n )) 
= E[E[MU n ,G(V n ))\V n }] 
= E</> n (V n ,G(V n )) 



(A.9) 



(A. 10) 
(A.ll) 
(A.12) 



<P(5=)+ / <t> n (v n ,G(v n ))P v «(dv n ), (A.13) 

where (A.12) is due to (A.6), while (A.13) uses the fact that 
< 4> n (-, ■) < 1. Moreover, 



<t> n {v n ,G{v n ))P V n{dv n ) 



M r 

= J2 <f> n (v n ,w n (m))P vn (dv n ) (A.14) 

< V^- (A.15) 



Hence, 



E^(C/ n ,G(y n ))<P(5=) + v / ^ 



(A.16) 



Now we can use Lemma 9.3.1 in [39] to show that, 
given <S n , M, and an arbitrary R > 0, there exist a set 
{w n (l),...,w n (M)} C W" and a mapping G n : V" -> 
. . . , w"(M)}, such that 

P((y",G n (y"))^5 n )<P(^) 

+ P (i(F™, VF n ) > ni?) + exp (-M2~ Rn ) , (A.17) 

where 

i(v n , w n ) 4 log ^ . (^ n ,^) (A.18) 



is the information density [5]. Letting M = 2 n( - I( - V:W '> +e '> and 
i? = /(V; W) + e/2 and using the corresponding mapping 
G„, we get 

Ei> n (U n ,G n {V n ))<2^ 

+ exp(-2 n£/2 ) + P (i(V n , W n ) > nR) . (A.19) 

Since Eip n (U n , W n ) — S n -> as n -> oo, the first term 
goes to zero as n — > oo. The second term likewise goes to 
since e > 0. The third term goes to zero owing to the mean 
ergodic theorem for information densities [5, Theorem 8.5.1]. 
Choosing no large enough so that the right-hand side of the 
above inequality is less than e finishes the proof. ■ 

Appendix B 
Time mixing 

Our discussion of the time mixing technique essentially 
follows [3, p. 4200], except that care must be taken due to 
the fact that we are working with general alphabets here. 

Fix a space U. Let U n — (Ui, . . . , U n ) be a random n- 
tuple taking values in U™ according to some law Prjn. Let 
T be a random variable uniformly distributed over the set [n] 
independently of U" . Consider the random variable Ut £ U, 
i.e., the value of the Tth coordinate of U n . We will use two 
facts pertaining to this construction. 

First, we note that Ut and T need not be independent, even 
though U n and T are. One exception is when U n is an i.i.d. 
tuple: 

Fact 1. If U n is an i.i.d. tuple with common marginal Pjj, 
then Ut is independent ofT and has the same law as U%, i.e., 
Pu- 

Proof: For any i G [n] and any A £ B\j, 

P Ut ,t(A x {z}) = P(T = i)P UT[T (A\i) (B.l) 

= nT = i)P Ut {A) (B.2) 
= P(T = i)Pu(A) (B.3) 
- P T {{i})Pu{A). (B.4) 

Hence, Pxj t \t(A\i) — Pjj(A), regardless of i. ■ 
Second, let us consider the empirical distribution Pj/r> . Since 
U is a Borel space, "P(U) is a (complete separable) metric 
space under any metric that metrizes the weak convergence 
of probability laws, so we can equip it with its Borel a- 
algebra. Then Pun is a 7 , (U)-valued random variable, whose 
expectation EP^" is given by 

1 " 

\EP un ]{A) 4 -Vp ai (i), \fA e By. (B.5) 
n £ — ' 

i=i 

It is not hard to check that EP^" satisfies the Kolmogorov 
axioms and is itself an element of "P(U). In particular: 

Fact 2. Consider the empirical distribution Pjjn. Then 

EP V n=P UT , (B.6) 
where P Ut € V(U) is the law ofU T . 
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Proof: For any ieSy, 
1 - 

[EP un ](A) = -J2PuM) 

71 ^ ' 



(B.7) 

(B.8) 

(B.9) 
(B.10) 
(B.ll) 



= E ^P(T = i)l {c/i£A} 
.i=i 

= E[E[l {UTeA} \U n ]] 
= E [l{tr T eA}] 
= Pc/ T (^). 

Since A is arbitrary, (B.6) indeed holds. ■ 

Appendix C 
Proof of Lemma 2 

The proof is very similar to the proof of Lemma 5.3 of 
Wyner [23]. In particular, only part (a) requires modification. 
Parts (b) and (c) follow immediately, just as in [23]. 

Since Q g £(A), there exists a function g : Y x U — > X, 
such that, with W = g(Y, U), 



!WY 



p \\r - A - 



(CI) 



Secondly, owing to the smoothness assumption (80), for any 



Si > one can find a quantizer q 
N < oo, such that 



X 



Let go = qo g, and define the sets 

Q - {(Vi u ) e Y x U : g (y,u) = Xj} , 



mu c x, 

(C2) 



1 < j < iV. 

(C3) 



Lemma 5.4 in [23] can be used to show that, for an arbitrary 
62 > 0, there exists a collection of disjoint sets {Sj}jL 1 C 
By (g> 2?u, where each S*, is a finite union of rectangles, and 



Q Y u(SjACj) < S 2 , l<j<N. 
Now define gi : Y x U — > X by 

' Xj, if (j/,u) e Sj 



xi, if (y, u) |Ji=i # 



(C4) 



(C5) 



Define also the set J5 = U^Li(Q n Sj) an d note tnat .9i = 9o 
on S. Then 

E[f(gi(Y,U),Y)} 

= E[l E f(g (Y, U),Y)] + E[l E of(gi(Y, U), Y)] (C.6) 

<E[/(.g (r,C/),r)] + Q yt/ (^ c ) (C.7) 

= E[f(q(W),Y)]+Q YU (E c ) (C.8) 

^ELfCw.rji + tfi + Qy^). (c.9) 

Similarly, 

E[f(W,Y)] 

<E[f(q(W),Y)]+Si (CIO) 

= E[1 E f(q(W),Y)} + E[l E o f(q(W), Y)] + Si (C.ll) 

= E[l E f( 9l (Y, U),Y)] + E[l E * f(q(W), Y)] + Si (C.12) 

< E[f(gi(Y, U),Y)] + Qyu(E c ) + S x . (C.13) 



In both cases we have used the fact that / is bounded between 
and 1, as well as (C.2). Moreover, using the fact that {Cj} 
is a disjoint partition of Y x U, as well as (C.4), we can write 

,7 

Qyu(E c ) < QvuiSjACj) < NS 2 . (C.14) 

Combining (C.l), (C.9) and (C.13), we get 

\\Q Wi y - QwrWr < *! + N5 2 , (C.15) 

where W\ — gi(Y, U). Now, given S > 0, first choose Si = 
S/2. This fixes N = N{S). Then choose S 2 so that NS 2 < 5/2. 
This proves part (a); parts (b) and (c) follow exactly as in [23]. 
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